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Abstract — In  this  paper,  we  propose  a  novel  method  for 
characteristic  patterns  discovery  in  time  series.  This  method, 
called  SAX-VSM,  is  based  on  two  existing  techniques  -  Symbolic 
Aggregate  approximation  and  Vector  Space  Model.  SAX-VSM  is 
capable  to  automatically  discover  and  rank  time  series  patterns 
by  their  importance  to  the  class,  which  not  only  creates  well- 
performing  classifiers  and  facilitates  clustering,  but  also  provides 
an  interpretable  class  generalization.  The  accuracy  of  the  method, 
as  shown  through  experimental  evaluation,  is  at  the  level  of  the 
current  state  of  the  art.  While  being  relatively  computationally 
expensive  within  a  learning  phase,  our  method  provides  fast, 
precise,  and  interpretable  classification. 

I.  Introduction 

Time  series  classification  is  an  increasingly  popular  area 
of  research  providing  solutions  to  the  wide  range  of  fields 
including  data  mining,  image  and  motion  recognition,  signal 
processing,  environmental  sciences,  health  care,  and  chemo- 
metrics.  Within  last  decades,  many  time  series  representa¬ 
tions,  similarity  measures,  and  classification  algorithms  were 
proposed  following  the  rapid  progress  in  data  collection  and 
storage  technologies  IT].  Nevertheless,  to  date,  the  best  over¬ 
all  performing  classifier  in  the  field  is  a  nearest-neighbor 
algorithm  (INN),  that  can  be  easily  tuned  for  a  particular 
problem  by  the  choice  of  a  distance  measure,  an  approximation 
technique,  or  smoothing  0.  As  pointed  by  dozens  of  papers,  a 
simple  “lazy”  nearest  neighbor  classifier  is  accurate  and  robust, 
depends  on  a  very  few  parameters  and  requires  no  training  IH, 
0,  a,  DD.  However,  while  possessing  these  qualities,  INN 
technique  has  a  number  of  significant  disadvantages,  where 
the  major  shortcoming  is  that  it  does  not  offer  any  insight 
into  the  classification  results.  Another  limitation  is  its  need 
for  a  significantly  large  training  set,  that  represents  a  class 
variance,  in  order  to  achieve  a  good  accuracy.  Finally,  while 
having  trivial  initialization,  INN  classification  is  computation¬ 
ally  expensive.  Thus,  the  demand  for  a  simple,  efficient,  and 
interpretable  classification  technique  capable  of  processing  of 
large  data  collections  remains. 

In  this  work,  we  address  outlined  above  limitation  by 
proposing  an  alternative  to  INN  algorithm  that  provides  a 
superior  interpretability,  learns  efficiently  from  a  small  training 
set,  and  has  a  low  computational  complexity  in  classification. 
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The  paper  is  structured  as  follows.  Section  HI]  provides 
background  into  the  existing  algorithms  and  discusses  rele¬ 
vant  work.  Section  HID  provides  background  for  a  proposed 
algorithm.  In  Section  IIVI  we  describe  our  algorithm,  and  in 
Section  |V]  we  evaluate  its  performance.  Finally,  we  form  our 
conclusions  and  discuss  future  work  in  Section  IVIII 

II.  Prior  and  related  work 

Almost  all  of  the  existing  techniques  for  time  series  clas¬ 
sification  can  be  divided  in  two  major  categories  El.  The 
first  category  of  classification  techniques  is  based  on  shape- 
based  similarity  metrics  -  where  distance  is  measured  directly 
between  time  series  points.  Classical  example  of  methods 
from  this  category  is  a  nearest  neighbor  classifier  built  upon 
Euclidean  distance  0  or  SpADe  0  .  The  second  category  con¬ 
sists  of  classification  techniques  based  on  structural  similarity 
metrics  which  employ  some  high-level  representations  of  time 
series  based  on  their  global  or  local  features.  Examples  from 
this  category  include  classifier  based  on  Discrete  Fourier  Trans¬ 
form  0  and  a  classifier  based  on  Bag-Of-Patterns  representa¬ 
tion  (BOP)  0.  The  development  of  these  distinct  categories 
can  be  explained  by  differences  in  their  performance:  while 
shape-based  similarity  methods  virtually  unbeatable  on  short, 
often  pre-processed  time  series  data  0,  they  usually  fail  on 
long  and  noisy  data  sets  0,  where  structure-based  solutions 
demonstrate  a  superior  performance. 

As  possible  alternatives  to  these  two  categories,  two  relevant 
to  our  work  techniques,  were  recently  proposed.  The  first 
technique  is  the  time  series  shapelets  algorithm,  that  was 
introduced  in  Qol  and  is  featuring  a  superior  interpretability 
and  a  compactness  of  delivered  solution.  A  shapelet  is  a  short 
time  series  “snippet”,  that  is  a  representative  of  class  member¬ 
ship  and  is  used  for  a  decision  tree  construction  facilitating 
class  identification  and  interpretability  m.  In  order  to  find  a 
branching  shapelet,  the  algorithm  exhaustively  searches  for  a 
best  discriminatory  shapelet  on  data  split  via  an  information 
gain  measure.  The  algorithm’s  classification  is  built  upon  the 
similarity  measure  between  a  branching  shapelet  and  a  full 
time  series,  defined  as  a  distance  between  the  shapelet  and  a 
closest  subsequence  in  the  series  when  measured  by  the  nor- 


malized  Euclidean  distance.  This  technique,  potentially,  com¬ 
bines  the  superior  precision  of  shape-based  exact  similarity 
methods,  and  the  high-throughput  classihcation  capacity  and 
efficiency  of  feature-based  approximate  techniques.  However, 
while  demonstrating  a  superior  interpretability,  robustness, 
and  similar  to  kNN  algorithms  performance,  shapelets-based 
algorithms  are  computationally  expensive  (0(n^m®),  where 
n  is  a  number  of  objects  and  m  is  the  length  of  a  longest 
time  series),  which  makes  difficult  its  adoption  for  many- 
class  classihcation  problems  IfT^.  While  a  better  solution  was 
recently  proposed  (0(nm^)),  it  is  an  approximate  algorithm, 
that  is  based  on  iSAX  approximation  and  indexing  ifTSl. 

The  second  relevant  to  our  work  approach  is  the  INN 
classiher  built  upon  the  Bag-Of-Patterns  (BOP)  representation 
of  time-series  i).  BOP  representation  of  a  time  series  is 
equated  to  IR  “bag  of  words”  concept,  and  is  obtained  by 
extraction,  symbolic  approximation  with  SAX,  and  counting 
of  occurrence  frequencies  of  short  overlapping  subsequences 
(patterns)  along  the  time  series.  By  applying  this  procedure 
to  a  training  set,  algorithm  converts  the  data  into  the  vector 
space,  where  each  of  the  original  time  series  is  represented  by 
a  pattern  (a  SAX  word)  occurrence  frequency  vector.  These 
vectors  are  classihed  with  INN  classiher  built  upon  Euclidean 
distance,  or  Cosine  similarity  on  raw  frequencies  or  with 
tf*idf  ranking.  It  was  shown  by  the  authors,  that  BOP  has 
several  advantages:  it  has  a  linear  complexity  (0(nm)),  it 
is  rotation-invariant  and  considers  local  and  global  structures 
simultaneously,  and  it  provides  an  insight  into  patterns  distribu¬ 
tion  through  frequency  histograms.  Through  an  experimental 
evaluation  the  authors  concluded,  that  the  best  classihcation 
accuracy  of  BOP-represented  time  series  is  achieved  by  using 
INN  classiher  based  on  Euclidean  distance  between  frequency 
vectors. 

Our  proposed  algorithm  has  similarities  with  aforemen¬ 
tioned  techniques.  Similarly  to  shapelet-based  approach,  it 
hnds  time  series  subsequences  which  are  characteristic  repre¬ 
sentatives  of  a  whole  class,  thus  enabling  superior  interpretabil¬ 
ity.  However,  instead  of  recursive  search  for  discriminating 
shapelets,  our  algorithm  ranks  by  importance  all  potential 
candidate  subsequences  at  once  with  a  linear  computational 
complexity  of  0{nm).  To  achieve  this,  similarly  to  BOP, 
SAX-VSM  converts  all  of  the  training  time  series  into  the 
vector  space  and  computes  tf*idf  ranking.  But  instead  of 
building  of  n  bags  (for  each  of  the  training  time  series),  our 
algorithm  builds  a  single  bag  of  words  for  each  of  classes,  that 
effectively  provides  a  compact  solution  of  N  weight  vectors 
{N  is  the  number  of  classes,  N  «  n),  and  a  fast  classihcation 
time  of  0(m). 

As  we  shall  show,  these  distinct  features:  the  generalization 
of  the  class’  patterns  with  a  single  bag  and  tffiidf  ranking, 
allow  SAX-VSM  to  achieve  high  accuracy,  and  tolerate  noise 
in  data. 

III.  Background 

SAX-VSM  is  based  on  two  well-known  techniques.  The  hrst 
technique  is  Symbolic  Aggregate  approximation  m,  which 


is  a  high-level  symbolic  representation  of  time  series  data.  The 
second  technique  is  a  well  known  in  Information  Retrieval 
(IR)  Vector  Space  Model  ifTSl.  By  utilizing  a  sliding  window 
subsequence  extraction  and  SAX,  our  algorithm  transforms 
labeled  time  series  into  collections  of  SAX  words  (terms). 
At  the  following  step,  it  utilizes  tf*idf  terms  weighting  for 
a  classiher  construction.  The  SAX-VSM  classihcation  relies 
on  cosine  similarity  metric. 

SAX  algorithm,  however,  requires  two  parameters  to  be 
provided  as  an  input,  and  as  per  today,  there  is  no  efficient 
solution  for  parameters  selection  known  to  the  best  of  our 
knowledge.  To  solve  this  problem,  we  employ  a  global  opti¬ 
mization  scheme  based  on  the  divided  rectangles  (DIRECT) 
algorithm  that  does  not  require  any  parameters  ira.  DIRECT 
is  a  derivative-free  optimization  process  that  possesses  local 
and  global  optimization  properties.  It  converges  relatively 
quickly  and  yields  a  deterministic,  optimized  solution. 

A.  Symbolic  Aggregate  approximation  (SAX) 

Symbolic  representation  of  time  series,  once  introduced 
m,  has  attracted  much  attention  by  enabling  an  application 
of  numerous  string-processing  algorithms,  bioinformatics,  and 
text  mining  tools  to  temporal  data.  The  method  provides  a 
signihcant  reduction  of  the  time  series  dimensionality  and  a 
low-bounding  to  Euclidean  distance  metric,  which  guarantees 
no  false  dismissal  ini.  These  properties  are  often  leveraged 
by  other  techniques,  which  embed  SAX  representation  in 
their  algorithms  for  indexing  and  approximation.  Eor  exam¬ 
ple,  adoption  of  SAX  indexing  allowed  signihcant  shapelets 
discovery  speed  improvement  in  East-Shapelets  ifTSl  (but  made 
the  algorithm  approximate). 

Conhgured  by  two  parameters  -  a  desired  word  size  w  and 
an  alphabet  size  A,  SAX  produces  a  symbolic  approximation 
of  a  time-series  T  of  a  length  n  by  compressing  it  into  a  string 
of  the  length  w  (usually  w  «n),  whose  letters  are  taken  from 
the  alphabet  a  (|a|  =  A).  At  the  hrst  step  of  the  algorithm,  T 
is  z-normalized  (to  unit  of  standard  deviation)  m.  At  the 
second  step,  a  dimensionality  of  the  normalized  time  series 
is  reduced  from  n  to  w  by  obtaining  its  Piecewise  Aggregate 
Approximation  (PAA)  ll20l:  for  this,  the  normalized  time  series 
is  divided  into  w  equal-sized  segments  and  mean  values  for 
points  within  each  segment  are  computed.  The  aggregated 
sequence  of  these  mean  values  forms  PAA  approximation  of  T. 
Einally,  each  of  w  PAA  coefficients  is  converted  into  a  letter 
of  an  alphabet  a  by  the  use  of  the  lookup  table.  This  table 
is  pre-built  by  dehning  a  set  of  breakpoints  that  divide  the 
normalized  time  series  distribution  space  into  a  equiprobable 
regions.  The  design  of  these  tables  rests  on  the  assumption 
that  normalized  series  tend  to  have  Gaussian  distribution  ED. 

B.  Bag  of  words  representation  of  time  series 

Eollowing  its  introduction,  SAX  was  shown  to  be  an 
efficient  tool  for  solving  problems  of  hnding  motifs  and 
discords  in  time  series  CD,  ED.  The  authors  employed  a 
sliding  window-based  subsequence  extraction  technique  and 
augmented  data  structures  (hash  table  in  ll22l  and  trie  in 
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03)  in  order  to  build  SAX  words  “vocabularies”.  Further, 
by  analyzing  words  frequencies  and  locations,  they  were  able 
to  capture  frequent  and  rare  SAX  words  representing  motifs 
and  discords  subsequences.  Later,  the  same  technique  based 
on  the  combination  of  sliding  window  and  SAX  was  used  in 
the  numerous  works,  most  notably  in  time  series  classihcation 
using  bag  of  patterns  ii. 

We  also  use  this  sliding  window  technique  to  convert  a  time 
series  T  of  a  length  n  into  the  set  of  m  SAX  words,  where 
m  =  (n  —  L)  +  1  and  L  is  the  sliding  window  length.  By 
sliding  a  window  of  length  L  across  time  series  T,  extracting 
subsequences,  converting  them  to  SAX  words,  and  placing 
these  words  into  an  unordered  collection,  we  obtain  the  bag 
of  words  representation  of  the  original  time  series  T. 


C.  Vector  Space  Model  (VSM)  adaptation 


We  use  Vector  space  model  exactly  as  it  is  known  in 
information  retrieval  (IR)  ES).  Similarly  to  IR,  we  define  and 
use  terms  document,  bag  of  words,  corpus,  and  sparse  matrix 
in  our  workflow.  Note  however,  that  we  use  terms  bag  of  words 
and  document  for  abbreviation  of  an  unordered  collection  of 
SAX  words  interchangeably,  while  in  IR  these  usually  bear 
different  meaning,  where  a  document  usually  presumes  certain 
words  ordering  (semantics).  Although,  similar  definitions,  such 
as  bag  of  features  or  bag  of  patterns,  were  previously  proposed 
for  techniques  built  upon  SAX  |[8l,  we  use  bag  of  words  since 
it  reflects  our  workflow  precisely.  The  term  corpus  is  used  for 
a  structured  collection  of  bags  of  words. 

Given  a  training  set,  SAX- VSM  builds  bags  of  SAX- 
generated  words  representing  each  of  the  training  classes  and 
assembles  them  into  a  corpus.  This  corpus,  by  its  construc¬ 
tion,  is  a  sparse  term  frequency  matrix.  Rows  of  this  matrix 
correspond  to  the  set  of  all  SAX  words  found  in  all  classes, 
while  each  column  of  the  matrix  denotes  a  class  of  the  training 
set.  Each  element  of  this  matrix  is  an  observed  frequency  of 
a  word  in  a  class.  Many  elements  of  this  matrix  are  zeros  - 
because  words  extracted  from  one  class  are  often  not  found 
in  others  (Figure  |4|.  By  its  design,  this  sparse  term  frequency 
matrix  is  a  dictionary  of  all  SAX  words  extracted  from  all 
time  series  of  a  training  set,  which  accounts  for  frequencies 
of  each  word  in  each  of  the  training  classes. 

Following  to  the  common  in  IR  workflow,  we  employ  the 
tfridf  weighting  scheme  for  each  element  of  this  matrix  in 
order  to  transform  a  frequency  value  into  the  weight  coefficient. 
The  tfridf  weight  for  a  term  is  defined  as  a  product  of  two 
factors:  term  frequency  (f/)  and  inverse  document  frequency 
{idf).  For  the  first  factor,  we  use  logarithmically  scaled  term 
frequency  ll23ll: 

flog(l +ft,d),  ifft,d>0 
fft.d  =  <  ,  .  (I) 

0,  otherwise 


where  t  is  the  term,  d  is  a  bag  of  words  (a  document),  and 
ft,d  is  a  frequency  of  the  term  in  a  bag. 

The  inverse  document  frequency  we  compute  as  usual: 

\D\  N 

\deD:t€d\  dft 


idft,D  =  logj 


(2) 
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Fig.  1.  An  overview  of  SAX- VSM  algorithm:  at  first,  labeled  time  series 
are  converted  into  bags  of  words  using  SAX;  secondly,  statistics 

is  computed  resulting  in  a  single  weight  vector  per  training  class.  For 
classification,  an  unlabeled  time  series  is  converted  into  a  term  frequency 
vector  and  assigned  a  label  of  a  weight  vector  which  yields  a  maximal  cosine 
similarity  value.  This  is  Itc.nnn  weighting  schema  in  SMART  notation  (23). 


where  N  is  the  cardinality  of  corpus  D  (the  total  number  of 
classes)  and  the  denominator  df*  is  a  number  of  documents 
where  the  term  t  appears. 

Then,  tfridf  value  for  a  term  t  in  the  document  d  of  a  corpus 
D  is  defined  as 

tf  *  idf(i,  d,  D)  =  tft,d  X  idft,o  =  log(l  -F  ft,d)  x  log^  (3) 

flit 

for  the  all  cases  where  L  ^  >  0  and  dft  >  0,  or  zero  otherwise. 
Once  all  terms  of  a  corpus  are  weighted,  the  columns  of 
a  sparse  matrix  are  used  as  class  term  weights  vectors  that 
facilitate  the  classification  using  cosine  similarity. 

Cosine  similarity  measure  between  two  vectors  is  based  on 
their  inner  product.  For  two  vectors  a  and  b  that  is: 


similarity(a,  b)  =  cos{6) 


a  ■  b 

ll«ll -11611 


(4) 


IV.  SAX-VSM  CLASSIFICATION  ALGORITHM 


As  many  other  classification  techniques,  SAX-VSM  con¬ 
sists  of  two  parts  -  the  training  phase,  and  the  classification 
procedure. 


A.  Training  phase 

At  first,  algorithm  transforms  all  labeled  time  series  into 
symbolic  representation.  For  this,  it  converts  time  series  into 
SAX  representation  configured  by  four  parameters:  the  sliding 
window  length  (W),  the  number  of  PAA  frames  per  win¬ 
dow  (P),  the  SAX  alphabet  size  (A),  and  by  the  numerosity 
reduction  strategy  (5)  (the  choice  of  these  parameters  we 
shall  discuss  later).  Each  of  the  subsequences,  extracted  with 
overlapping  sliding  window,  is  normalized  to  unit  standard 
deviation  before  being  processed  with  PAA  IT^.  If,  however, 
the  standard  deviation  value  falls  below  a  fixed  threshold,  the 
normalization  procedure  is  not  applied  in  order  to  avoid  a 
possible  over-amplification  of  a  background  noise. 

By  applying  this  conversion  procedure  to  all  time  series 
from  N  training  classes,  algorithm  builds  a  corpus  of  N  bags, 
to  which,  in  turn,  it  applies  fridf  ranking.  These  steps  result 
in  N  real-valued  weight  vectors  of  equal  length  representing 
N  training  classes. 

As  shown,  because  of  the  need  to  scan  the  whole  training  set, 
training  of  SAX-VSM  classifier  is  computationally  expensive 


{0{nm)).  However,  there  is  no  need  to  maintain  an  index  of 
training  series,  or  to  keep  any  of  them  in  the  memory  at  a 
runtime:  the  algorithm  simply  iterates  over  all  training  time 
series  incrementally  building  a  single  bag  of  SAX  words  for 
each  of  training  classes.  Once  built  and  processed  with 
corpus  is  also  discarded  -  only  a  resulting  set  of  N  real-valued 
weight  vectors  is  retained  for  classification. 


B.  Classification  phase 

In  order  to  classify  an  unlabeled  time-series,  SAX-VSM 
transforms  it  into  the  terms  frequency  vector  using  exactly 
the  same  sliding  window  technique  and  SAX  parameters  that 
were  used  within  the  training  phase.  Then,  it  computes  cosine 
similarity  values  between  this  terms  frequency  vector  and  N 
tf*idf  weight  vectors  representing  the  training  classes.  The 
unlabeled  time  series  is  assigned  to  the  class  whose  vector 
yields  the  maximal  cosine  similarity  value. 

C.  Sliding  window  size  and  SAX  parameters  selection 

At  this  point  of  SAX-VSM  classification  algorithm  develop¬ 
ment,  it  requires  a  sliding  window  size  and  SAX  parameters 
to  be  specified  upfront.  Currently,  in  order  to  select  optimal 
parameters  values  while  knowing  only  a  training  data  set, 
we  use  a  common  cross-validation  scheme  and  DIRECT 
(Dividing  RECTangles)  algorithm,  which  was  introduced  in 
12^.  DIRECT  optimization  algorithm  is  designed  to  search 
for  global  minima  of  a  real  valued  function  over  a  bound 
constrained  domain,  thus,  we  use  the  rounding  of  a  reported 
solution  values  to  the  nearest  integer. 

DIRECT  algorithm  iteratively  performs  two  procedures  - 
partitioning  the  search  domain,  and  identifying  potentially 
optimal  hyper-rectangles  (i.e.,  having  potential  to  contain  good 
solutions).  It  begins  by  scaling  the  search  domain  to  a  n- 
dimensional  unit  hypercube  which  is  considered  as  potentially 
optimal.  The  error  function  is  then  evaluated  at  the  center 
of  this  hypercube.  Next,  other  points  are  created  at  one-third 
of  the  distance  from  the  center  in  all  coordinate  directions. 
The  hypercube  is  then  divided  into  smaller  rectangles  that  are 
identified  by  their  center  point  and  their  error  function  value. 
This  procedure  continues  interactively  until  error  function 
converges.  Eor  brevity,  we  omit  the  detailed  explanation  of  the 
algorithm,  and  refer  the  interested  reader  to  m  for  additional 
details.  Eigure  |2]  illustrates  the  application  of  DIRECT  to 
SyntheticControl  data  set  problem. 

D.  Intuition  behind  SAX-VSM 

Eirst  of  all,  by  combining  all  SAX  words  extracted  from 
all  time  series  of  single  class  into  a  single  bag  of  words, 
SAX-VSM  manages  not  only  to  capture  observed  intraclass 
variability,  but  to  efficiently  “generalize”  it  through  smoothing 
with  PAA  and  SAX. 

Secondly,  by  partially  discarding  the  original  ordering  of 
time  series  subsequences  and  through  subsequence  normal¬ 
ization,  SAX-VSM  is  capable  to  capture,  and  to  recognize 
characteristic  subsequences  in  distorted  by  rotation  or  shift 


Fig.  2.  Parameters  optimization  with  DIRECT  for  SyntheticControl  data 
set  (6  classes).  Left  panel  shows  all  points  sampled  by  DIRECT  in  the  space 
P A A*Window* Alphabet  where  red  points  correspond  to  high  error  values 
in  cross-validation  experiments,  while  green  points  indicate  low  error  values. 
Note  the  green  points  concentration  at  W=42.  Middle  panel  shows  an  error- 
rate  heat  map  when  the  sliding  window  size  is  fixed  to  42;  this  figure  was 
obtained  by  a  complete  scan  of  all  432  points  of  the  slice.  Right  panel  shows 
the  optimized  by  DIRECT  sampling.  The  optimal  solution  {W=42,P=^,A=4) 
was  found  by  sampling  of  43  points. 

time  series,  as  well,  as  to  recover  a  signal  from  partially 
corrupted  or  altered  by  noise. 

Thirdly,  the  tf*idf  statistics  naturally  “highlights”  terms 
unique  to  a  class  by  assigning  them  higher  weights,  while 
terms  observed  in  multiple  classes  are  assigned  weights  in¬ 
versely  proportional  to  their  interclass  presence  frequency. 
This  weighting  scheme  improves  the  selectivity  of  classifi¬ 
cation  by  lowering  a  contribution  of  “confusive”  multi-class 
terms  while  increasing  a  contribution  of  class’  “defining” 
terms  to  a  final  similarity  value. 

When  combined,  these  features  make  SAX-VSM  time  series 
classification  approach  unique.  Ultimately,  algorithm  compares 
a  set  of  subsequences  extracted  from  an  unlabeled  time  series 
with  a  weighted  set  of  all  characteristic  subsequences  rep¬ 
resenting  a  whole  of  a  training  class.  Thus,  unknown  time 
series  is  classified  by  its  similarity  not  to  a  given  number 
of  “neighbors”  (as  in  kNN  or  BOP  classifiers),  or  to  a 
pre-fixed  number  of  characteristic  features  (as  in  shapelets- 
based  classifiers),  but  by  its  combined  similarity  to  all  known 
discriminative  subsequences  found  in  a  whole  class  during 
training. 

This,  as  we  shall  show,  contributes  to  the  excellent  classi¬ 
fication  performance  on  temporal  data  sets  where  time  series 
have  a  very  low  intraclass  similarity  at  the  full  length,  but 
embed  characteristic  to  the  class  subsequences. 

V.  Results 

We  have  proposed  a  novel  algorithm  for  time  series  classifi¬ 
cation  based  on  SAX  approximation  of  time  series  and  Vector 
Space  Model  called  SAX-VSM.  Here,  we  present  a  range 
of  experiments  assessing  its  performance  in  classification 
and  clustering  and  show  its  ability  to  provide  insight  into 
classification  results. 

A.  Analysis  of  the  classification  accuracy 

To  evaluate  our  approach,  we  selected  thirty  three  data 
sets.  Majority  of  the  data  sets  was  taken  from  the  UCR  time 
series  repository  EH,  the  Eord  data  set  was  downloaded  from 
IEEE  World  Congress  on  Computational  Intelligence  website 
ll28l.  the  ElectricDevices  data  set  was  downloaded  from  sup¬ 
porting  website  for  lfT2ll.  Overall,  SAX-VSM  classification 


Table  I 

Classifiers  error  rates  comparison. 


Data  set 

Nb.  of 
classes 

INN- 

Euclidean 

INN- 

DTW 

Fast 

Shapelet 

Tree 

Bag 

Of 

Patterns 

SAX- 

VSM 

Adiac 

37 

0.389 

0.396 

0.515 

0.432 

0.381 

Beef 

5 

0.467 

0.467 

0.447 

0.400 

0.033 

CBF 

3 

0.148 

0.003 

0.053 

0.013 

0.002 

Coffee 

2 

0.250 

0.180 

0.067 

0.036 

0.0 

ECG200 

2 

0.120 

0.230 

0.227 

0.140 

0.140 

FaceAll 

14 

0.286 

0.192 

0.402 

0.219 

0.207 

FaceFour 

4 

0.216 

0.170 

0.089 

0.011 

0.0 

Fish 

7 

0.217 

0.167 

0.197 

0.074 

0.017 

Gun-Point 

2 

0.087 

0.093 

0.060 

0.002 

0.007 

Lightning2 

2 

0.246 

0.131 

0.295 

0.164 

0.196 

Lightning? 

7 

0.425 

0.274 

0.403 

0.466 

0.301 

Olive  Oil 

4 

0.133 

0.133 

0.213 

0.133 

0.100 

OSU  Leaf 

6 

0.483 

0.409 

0.359 

0.236 

0.107 

Syn. Control 

6 

0.120 

0.007 

0.081 

0.037 

0.010 

Swed.Leaf 

15 

0.213 

0.210 

0.270 

0.198 

0.251 

Trace 

4 

0.240 

0.0 

0.002 

0.0 

0.0 

Two  patterns 

4 

0.090 

0.0 

0.113 

0.129 

0.004 

Wafer 

2 

0.005 

0.020 

0.004 

0.003 

0.0006 

Yoga 

2 

0.170 

0.164 

0.249 

0.170 

0.164 

performance  was  found  to  be  at  the  level  of  INN  classifiers 
based  on  Euclidean  distance,  DTW,  or  BOP,  and  a  shapelet- 
tree.  This  result  is  not  surprising  taking  in  account  “No  Free 
Lunch  theorems”  1291.  which  assert,  that  there  will  not  be  a 
single  dominant  classifier  for  all  TSC  problems. 

Table  U  compares  the  performance  of  SAX-VSM  and 
four  competing  classifiers;  two  state-of-the-art  INN  classifiers 
based  on  Euclidean  distance  and  DTW,  the  classifier  based  on 
the  recently  proposed  Fast-Shapelets  technique  ifTSl.  and  the 
classifier  based  on  BOP  lH.  We  selected  these  particular  tech¬ 
niques  in  order  to  position  SAX-VSM  in  terms  of  accuracy  and 
interpretability.  The  presented  comparison  data  sets  selection 
is  limited  to  the  number  of  previously  published  or  provided 
by  the  authors  benchmark  results  for  all  of  four  competing 
classifiers.  The  performance  of  SAX-VSM  for  the  rest  of 
the  data  sets  will  be  made  online  along  with  our  reference 
implementation  if  accepted. 

In  our  evaluation,  we  followed  train/test  split  of  the  data 
(exactly  as  provided  by  UCR  or  other  sources).  We  exclusively 
used  train  data  in  cross-validation  experiments  for  selection 
of  SAX  parameters  and  numerosity  reduction  strategy  using 
our  DIRECT  implementation.  Once  selected,  the  optimal  set 
of  parameters  was  used  to  assess  SAX-VSM  classification 
accuracy  which  is  reported  in  the  last  column  of  the  Table 

III 

B.  Scalability  analysis 

For  synthetic  data  sets,  it  is  possible  to  create  as  many 
instances  as  one  needs  for  experimentation.  We  used  CBF  ll^ 
in  order  to  investigate  and  compare  the  performance  of  SAX- 
VSM  and  INN  Euclidean  classifier  on  increasingly  large  data 
sets. 

In  one  series  of  experiments,  we  varied  a  training  size  from 
ten  to  one  thousand,  while  test  data  set  size  remained  fixed 
to  ten  thousands  instances.  For  small  training  data  sets,  SAX- 


Fig.  3.  Comparison  of  classification  precision  and  run  time  of  SAX-VSM 
and  INN  Euclidean  classifier  on  CBF  data.  SAX-VSM  performs  significantly 
better  with  limited  amount  of  training  samples  (left  panel).  While  SAX-VSM 
is  faster  in  time  series  classification,  its  performance  is  comparable  to  INN 
Euclidean  classifier  when  training  time  is  accounted  for  (right  panel). 


VSM  was  found  to  be  significantly  more  accurate  than  INN 
Euclidean  classifier.  However,  by  the  time  we  had  more  than 
500  time  series  in  our  training  set,  there  was  no  statistically 
significant  difference  in  accuracy  (Fig.  [2  left).  As  per  the 
running  time  cost,  due  to  the  comprehensive  training,  SAX- 
VSM  was  found  to  be  more  expensive  than  INN  Euclidean 
classifier  on  small  training  sets,  but  outperformed  INN  on 
large  training  sets.  However,  SAX-VSM  allows  to  perform 
training  offline  and  load  tf*idf  weight  vectors  when  needed.  If 
this  option  can  be  utilized,  our  method  performs  classification 
significantly  faster  than  INN  Euclidean  classifier  (Fig.  [3 
right). 

In  another  series  of  experiments  we  investigated  the  scala¬ 
bility  of  our  algorithm  with  unrealistic  training  set  sizes  -  up  to 
one  million  of  instances  of  each  of  CBF  classes.  As  expected, 
with  the  grows  of  a  training  set  size,  the  curve  for  a  total 
number  of  distinct  SAX  words  and  curves  for  dictionary  sizes 
of  each  of  CBF  classes  reflected  a  significant  saturation  (Fig. 
m  left).  For  the  largest  of  training  sets  -  one  million  instances 
of  each  class  -  the  size  of  the  dictionary  peaked  at  67’ 324  of 
distinct  words  (which  is  less  than  10%  of  all  possible  words 
of  length  7  from  an  alphabet  of  7  letters),  and  the  longest 
tf>i<idf  vector  accounted  for  23’569  values  (Fig.  @1  right).  In 
our  opinion,  this  result  reflects  two  specificities:  the  first  is 
that  the  diversity  of  words  which  are  possible  to  encounter  in 
CBF  dataset  is  quite  limited  by  its  classes  configuration  and 
by  our  choice  of  SAX  parameters  (smoothing).  The  second 
specificity  is  that  IDF  (Inverse  Document  Frequency,  Equation 
|2|i  efficiently  limits  the  growth  of  dictionaries  by  eliminating 
those  words,  which  are  observed  in  all  of  them. 


Terms  count  evolution  for  CBF  classes 
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Terms  counts  and  distribution  for  1M  of  CBF  series 
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Fig.  4.  Left  panel:  illustration  of  dictionaries  size  evolution  for  CBF  with 
increasingly  large  training  set  size.  Right  panel:  distribution  of  SAX  terms  in 
CBF  corpus  for  training  set  of  one  million  series  of  each  class. 
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Fig.  5.  Classification  performance  with  added  noise  (left  panel;  the  random 
noise  level  varies  up  to  100%  of  the  signal  value,  and  with  a  signal  loss  (right 
panel).  SAX-VSM  Opt  curves  correspond  to  results  obtained  with  “optimized” 
for  each  case  SAX  parameters  (we  re-trained  a  classifier). 

C.  Robustness  to  noise 

In  our  experimentation  with  many  data  sets,  we  observed, 
that  the  growth  of  a  dimensionality  of  tfridf  weight  vectors 
continuously  follows  the  growth  of  a  training  set  size,  which 
indicates  that  SAX-VSM  is  actively  learning  from  class  vari¬ 
ability.  This  observation,  and  the  fact  that  a  weight  of  each 
of  the  overlapping  SAX  words  is  contributing  only  a  small 
fraction  to  a  final  similarity  value,  prompted  an  idea  that 
SAX-VSM  classifier  might  be  robust  to  the  noise  and  to  the 
partial  loss  of  a  signal  in  test  time  series.  Intuitively,  in  such 
a  case,  the  cosine  similarity  between  high  dimensional  weight 
vectors  might  not  degrade  significantly  enough  to  cause  a 
misclassification. 

While  we  plan  to  perform  more  exploration,  current  exper¬ 
imentation  with  CBF  data  set  revealed  promising  results.  In 
one  series  of  experiments,  by  hxing  a  training  set  size  to  two 
hundred  fifty  time  series,  we  varied  the  standard  deviation 
of  Gaussian  noise  in  CBF  model  (whose  default  value  is 
about  17%  of  a  signal  level).  We  found,  that  SAX-VSM 
increasingly  outperformed  INN  Euclidean  classifier  with  the 
growth  of  a  noise  level  (FiglS]  Left).  Further  improvement 
of  SAX-VSM  performance  was  achieved  by  fine  tuning  of 
smoothing  -  through  a  gradual  increase  of  the  size  of  SAX 
sliding  window  proportionally  to  the  growth  of  a  noise  level 
(Figl^Left,  SAX-VSM  Opt  curve). 

In  another  series  of  experiments,  we  randomly  replaced  up 
to  fifty  percent  of  a  span  of  an  unlabeled  time  series  with  a 
random  noise.  Again,  SAX-VSM  performed  consistently  better 
than  INN  Euclidean  classifier  regardless  of  a  training  set  size, 
which  we  varied  from  hve  to  one  thousand.  The  SAX-VSM 
Opt  curve  at  Fig|5]  (Right)  depicts  the  case  with  fifty  training 
series  when  the  sliding  window  size  was  decreased  inversely 
proportionally  to  the  growth  of  a  signal  loss. 

D.  Interpretable  classification 

While  the  classihcation  performance  results  in  previous  sec¬ 
tions  show  that  SAX-VSM  classiher  has  a  very  good  potential, 
its  major  strength  is  in  the  level  of  allowed  interpretability  of 
classification  results. 

Previously,  in  original  shapelets  work  uni,  ini,  it  was 
shown  that  the  resulting  decision  trees  provide  interpretable 
classification  and  offer  an  insight  into  the  data  specific  features. 
In  successive  work  based  on  shapelets  na,  it  was  shown 
that  the  discovery  of  multiple  shapelets  provides  even  better 


Fig.  6.  An  example  of  the  heatmap-like  visualization  of  subsequence 
“importance”  to  a  class  identification.  Here,  for  three  CBF  time  series  from 
a  training  set,  a  color  value  of  each  point  was  obtained  by  combining  tf*idf 
weights  of  all  patterns  which  cover  the  point.  If  a  pattern  was  found  in  a  SAX- 
VSM-built  dictionary  corresponding  to  the  time-series  class,  we  added  its 
weight,  if,  however,  a  pattern  was  found  in  another  dictionary  -  we  subtracted 
its  weight.  Highlighted  by  the  visualization  features  corresponding  to  a  sudden 
rise,  plateau,  and  a  sudden  drop  in  Cylinder;  increasing  trend  in  Bell;  and  to 
a  sudden  rise  followed  by  a  gradual  drop  in  Funnel,  align  exactly  with  the 
design  of  these  classes 


resolution  and  intuition  into  the  interpretability  of  classifica¬ 
tion.  However,  as  the  authors  noted,  a  time  cost  of  multiple 
shapelets  discovery  in  many  class  problems  could  be  very  sig¬ 
nificant.  Contrary,  SAX-VSM  extracts  and  weights  all  patterns 
at  once,  without  any  added  cost.  Thus,  it  could  be  the  only 
choice  for  interpretable  classification  in  many  class  problems. 

1)  Heatmap-like  visualization:  Since  SAX-VSM  builds 
tf>i<idf  weight  vectors  using  all  subsequences  extracted  from 
a  training  set,  it  is  possible  to  find  out  the  weight  of  any 
arbitrary  selected  subsequence.  This  feature  enables  a  novel 
visualization  technique  that  can  be  used  to  gain  an  immediate 
insight  into  the  layout  of  “important”  class-characterizing 
subsequences  as  shown  at  Figure  |6] 

2)  Gun  Point  data  set:  Following  previously  mentioned 
shapelet-based  work  Da,  Da,  we  used  a  well-studied  Gun- 
Point  data  set  ED  to  explore  the  interpretability  of  classifi¬ 
cation  results.  This  data  set  contains  two  classes:  time-series 
in  Gun  class  correspond  to  the  actors’  hands  motion  when 
drawing  a  replicate  gun  from  a  hip-mounted  holster,  pointing 


Gun  time  series  annotation 


Point  time  series  annotation 


Best  pattern,  Gun  Second  best  pattern,  Gun 


Fig.  7.  Best  characteristic  subsequences  (right  panels,  bold  lines)  discovered 
by  SAX-VSM  in  Gun/Point  data  set.  Left  panel  shows  actor’s  stills  and 
time  series  annotations  made  by  an  expert,  right  panels  show  locations  of 
chai'acteristic  subsequences.  Note,  that  while  the  upward  arm  motion  found  to 
be  more  “important”  in  Gun  class  (gun  retrieval  and  aiming),  the  downward 
arm  motion  better  characterizes  Point  class  (an  “overshoot”  phenomena  in 
propless  arm  return).  This  result  aligns  with  previous  work  Qo)  and  (m 
(Stills  and  annotation  used  with  a  permission  from  E.  Keogh) 


it  at  a  target  for  a  second,  and  returning  the  gun  to  the 
holster;  time-series  in  Point  class  correspond  to  the  actors 
hands  motion  when  pretending  of  drawing  a  gun  -  the  actors 
point  their  index  fingers  to  a  target  for  about  a  second,  and 
then  return  their  hands  to  their  sides. 

Similarly  to  previously  reported  results  Cl,  m,  SAX- 
VSM  was  able  to  capture  all  distinguishing  features  as  shown 
at  the  Figure  [T]  The  most  weighted  by  SAX-VSM  patterns 
in  Gun  class  corresponds  to  fine  extra  movements  required 
to  lift  and  aim  the  prop.  The  most  weighted  SAX  pattern  in 
Point  class  corresponds  to  the  “overshoot”  phenomena  which 
is  causing  the  dip  in  the  time  series.  Also,  similarly  to  the 
original  work  ED,  SAX-VSM  highlighted  as  second  to  the 
best  patterns  in  Point  class  the  lack  of  distinguishing  subtle 
extra  movements  required  for  lifting  a  hand  above  a  holster 
and  reaching  down  for  the  gun. 

3 )  OSU  Leaf  data  set:  According  to  the  original  data  source, 

Ashid  Grandhi  with  the  current  growth  of  digitized 

data,  there  is  a  huge  demand  for  automatic  management  and 
retrieval  of  various  images.  The  OSULeaf  data  set  consist  of 
curves  obtained  by  color  image  segmentation  and  boundary 
extraction  (in  the  anti-clockwise  direction)  from  digitized 
leaf  images  of  six  classes:  Acer  Circinatum,  Acer  Glabrum, 
Acer  Macrophyllum,  Acer  Negundo,  Quercus  Garryana  and 
Quercus  Kelloggii.  The  authors  were  able  to  solve  the  problem 
of  leaf  boundary  curves  classification  by  use  of  DTW,  achiev¬ 
ing  61%  of  classification  accuracy.  However,  as  we  pointed 
above,  DTW  provided  a  very  little  information  about  why  it 
succeeded  of  failed. 

In  contrast,  SAX-VSM  application  yielded  a  set  of  class- 
specific  characteristic  patterns  for  each  of  six  leaves  classes 
from  OSULeaf  data  set.  These  characteristic  patterns  closely 
match  known  techniques  of  leaves  classification  based  on  leaf 
shape  and  margin  Highlighted  by  SAX-CSM  features 
include  the  slightly  lobed  shape  and  acute  tips  of  Acer 
Circinatum  leaves,  serrated  blade  of  Acer  Glabrum  leaves,  the 
acuminate  tip  and  characteristic  serration  of  in  Acer  Macro¬ 
phyllum  leaves,  pinnately  compound  leaves  arrangement  of 
Acer  Negundo,  the  incised  leaf  margin  of  Quercus  Kelloggii, 
and  a  lobed  leaf  structure  of  Quercus  Garryana.  Figure  0 
shows  a  subset  of  these  characteristic  patterns  and  original 
leaf  images  with  highlighted  corresponding  features. 

4)  Coffee  data  set:  Another  illustration  of  interpretable 
classification  with  SAX-VSM  is  based  on  the  analysis  of  its 
performance  on  Coffee  dataset  ll34ll.  The  curves  in  this  dataset 
correspond  to  spectra  obtained  with  diffuse  reflection  infrared 
Fourier  transform  (DRIFT)  and  truncated  to  286  data  points 
in  the  region  800-1900  cm“^.  The  two  top-ranked  by  SAX- 
VSM  subsequences  in  both  datasets  correpond  to  spectrogram 
intervals  of  Chlorogenic  acid  (best)  and  Caffeine  (second 
to  best).  These  two  chemical  compounds  are  known  to  be 
responsible  for  the  flavor  differences  in  Arabica  and  Robusta 
coffees;  moreover,  these  spectrogram  intervals  were  reported 
as  discriminative  when  used  in  PCA-based  technique  by  the 
authors  of  the  original  work  ll34l. 
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Fig.  8.  Best  characteristic  subsequences  (top  panels,  bold  lines)  discovered 
by  SAX-VSM  in  OSULeaf  data  set.  These  patterns  align  with  well  known 
in  botany  discrimination  techniques  by  lobe  shapes,  seiTations,  and  leaf  tip 
types  f^. 
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Fig.  9.  Best  characteristic  subsequences  (left  panels,  bold  lines)  discovered 
by  SAX-VSM  in  Coffee  data  set.  Right  panels  show  zoom-in  view  on  these 
subsequences  in  Arabica  and  Robusta  spectrograms.  These  discriminative 
subsequences  correspond  to  chlorogenic  acid  (best  subsequence)  and  to 
caffeine  (second  to  best)  regions  of  spectra.  This  result  aligns  with  the  original 
work  based  on  PCA  o  exactly. 


VI.  Clustering 

Clustering  is  a  common  tool  used  for  data  partitioning,  visu¬ 
alization,  exploration,  and  serves  as  an  important  subroutine  in 
many  data  mining  algorithms.  Typically,  clustering  algorithms 
are  built  upon  a  distance  function,  and  the  overall  performance 
of  an  algorithm  is  highly  dependent  on  a  performance  of 
the  chosen  function.  Thus,  an  experimental  evaluation  of 
the  proposed  technique  in  clustering  provides  an  additional 
perspective  on  its  performance  and  applicability  beyond  the 
classification. 

A.  Hierarchical  clustering 

Probably,  one  of  the  most  used  clustering  algorithms  is 
hierarchical  clustering  which  requires  no  parameters  to  be 
specified  llTSl.  It  computes  pairwise  distances  between  all 
objects  and  produces  a  nested  hierarchy  of  clusters  offering 
a  great  data  visualization  power. 

Previously,  it  was  shown  that  the  bag-of-patterns  time 
series  representation  and  Euclidean  distance  provide  a  superior 
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Fig.  10.  An  comparison  of  hierarchical  clustering  application  to  a  subset  of 
three  SyntheticControl  classes:  Normal,  Decreasing  trend,  and  Upward  shift. 
Euclidean  distance,  Dynamic  time  warping,  SAX-VSM  and  Complete  linkage 
were  used  to  generate  these  plots.  Only  SAX-VSM  was  able  to  partition  series 
properly. 

clustering  performance  ||8l.  For  comparison,  we  performed 
similar  experiments  which  differ  in  time  series  representation 
and  distance  metric  -  we  relied  on  tfridf  weight  vectors  and 
cosine  similarity.  Affirming  the  previous  work,  we  found,  that 
the  combination  of  SAX  and  Vector  space  model  outperforms 
classical  shape-based  distance  metrics.  For  example,  figure 
[Tol  depicts  the  result  of  hierarchical  clustering  of  a  subset  of 
SyntheticControl  data.  As  one  can  see,  SAX-VSM  is  superior 
in  clustering  performance  to  Euclidean  and  DTW  distance 
metrics  in  this  particular  setup  -  it  produced  a  hierarchy  which 
properly  partitions  the  data  set  into  three  branches. 

B.  k-Means  clustering 

Another  popular  choice  for  data  partitioning  is  k-Means 
clustering  algorithm  ll^.  The  basic  intuition  behind  this 
algorithm  is  that  through  the  iterative  reassignment  of  objects 
into  different  clusters  the  intra-cluster  distance  is  minimized. 
As  was  shown,  k-Means  algorithm  scales  much  better  than 
hierarchical  partitioning  techniques  llJTll.  Fortunately,  this  clus¬ 
tering  technique  is  well  studied  in  IR  field.  Previously,  in 
ifMl.  the  authors  extensively  examined  seven  different  criterion 
functions  for  partitional  document  clustering  and  found,  that 
k-prototypes  partitioning  with  cosine  dissimilarity  delivers  an 
excellent  performance. 

Following  this  work,  we  implemented  a  similar  to  1^ 
spherical  k-means  algorithm  and  found,  that  algorithm  con¬ 
verges  quickly  and  delivers  a  satisfactory  partitioning  on  short 
synthetic  data  sets.  Further,  we  evaluated  our  technique  on  the 
long  time  series  from  PhysioNet  archive  BOll.  We  extracted 
two  hundred  fifty  series  corresponding  to  five  vital  signals; 
two  ECG  leads  (aVR  and  II),  and  RESP,  PLETH,  and  C02 
waves,  trimming  them  to  2’048  points.  Similarly  to  m,  we 
run  a  reference  k-Means  algorithm  implementation  based  on 
Euclidean  distance,  which  achieved  the  maximum  clustering 
quality  of  0.39,  when  measured  as  proposed  in  ED  on  the 
best  clustering  (the  one  with  the  smallest  objective  function 
in  10  runs).  SAX-VSM  spherical  k-Means  implementation 
outperformed  the  reference  technique  yielding  clusters  with 
the  quality  of  0.67  (on  10  runs  with  SAX  parameters  set  to 
tF=33,  P=8,  A=6). 


VII.  Conclusion  and  Euture  Work 

In  this  paper,  we  have  proposed  a  novel  interpretable 
technique  for  time  series  classification  based  on  characteristic 
patterns  discovery.  We  have  shown,  that  our  approach  is 
competitive  with,  or  superior  to,  other  techniques  on  a  variety 
of  classic  data  mining  problems.  In  addition,  we  described 
several  advantages  of  SAX-VSM  over  existing  structure-based 
similarity  measures,  emphasizing  its  capacity  to  discover  and 
rank  short  subsequences  by  their  class  characterization  power. 

The  current  limitations  of  our  SAX-VSM  implementation 
suggest  a  number  of  future  work  directions.  Eirst  of  all, 
while  Vector  space  model  naturally  supports  processing  of 
bags  of  words  composed  of  terms  of  variable  length,  our 
current  “stable”  implementation  lacks  this  capacity.  Inspired  by 
the  recently  reported  superior  performance  of  multi-shapelets 
based  classifiers  m,  we  prioritize  this  development.  Secondly, 
as  mentioned  before,  DIRECT  optimization  it  is  designed  for 
a  function  of  a  real  variable.  By  using  rounding  in  our  im¬ 
plementation,  we  have  observed  DIRECT  iteratively  sampling 
redundant  locations  in  suboptimal  neighborhood,  thus,  a  more 
appropriate  optimization  scheme  is  needed.  Einally,  we  are 
designing  and  experimenting  with  an  extension  of  SAX-VSM 
to  multidimensional  time  series.  Currently  we  are  evaluating 
two  candidate  implementations;  the  first  is  based  on  a  single 
bag  of  words  accommodating  all  dimensions  for  a  class  (by 
prefixing  SAX  words  extracted  from  different  dimensions); 
while  the  second  is  based  on  the  use  of  a  single  bag  of  words 
per  each  of  dimensions.  The  preliminary  results  on  synthetic 
data  sets  look  promising  and  we  expect  to  report  our  finding 
soon. 
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